CTREE, 1 June 2017

Acknowledgments

What will we talk about?

  • How we have used R Markdown in our empirical courses
  • Reproducibility with R Markdown
  • And some things that we think are just cool in Rmd.
  • BIG Thank you to Project TIER and the Alfred P. Sloan Foundation

How we have used R Markdown

Michael:

Senior thesis seminar with 9 students

  • Very little background
  • All stages in R Markdown
    • Data manipulation, visualization, analysis
    • Presentations
    • Final paper

How we have used R Markdown

Michael:

Advantages:

  • To students
    • Only one environment to deal with for everything
  • To me
    • Full reproducibility of all 9 papers (one award)
    • Much easier to reproduce in one document
    • Very professional appearance (with template)

Cost

  • Somewhat higher startup cost in teaching them R and R Studio

How we have used R Markdown

Aaron:

  • Create slides in an Econometrics course
  • Final project in senior research seminar
  • Create custom progress reports for students
  • Interactive shiny apps for micro students

R Markdown & Reproducibility

  • What happens in a traditional research report?
  • Are traditional research reports easily reproducible?
  • What gives us soup to nuts reproducibility?
  • Answer: R Markdown or Scripted LaTeX + STATA (StatTag?)
  • Works within the TIER framework
  • I've implemented a variation of this in an upper-level economics elective

Questions on your notecard

Please take a couple of minutes to write down a question you have about R Markdown and Reproducibility.

Think about the following ideas:

  1. What problem could R Markdown (and R Studio) solve for me?
  2. What is something I'm accustomed to doing that I would want R Markdown (and R Studio) to do for me?
  3. What is an expectation I have about what I want Simon et al. to cover?

Just take a moment to write down something along these lines for now and put aside your notecard.

Traditional Reports

Courtesy of Bray, 2016

Courtesy of Bray, 2016

The Good

  • familiar format, e.g. Word
  • easy learning curve

The Bad

  • tough for reproducibility
  • difficult to update
  • mistakes crop up
  • teams can't collaborate easily

The Ugly?

  • Word/GDocs = Ugly?

R Markdown Report/Notebook?

Courtesy of Bray, 2016

Courtesy of Bray, 2016

The good

  • easy to reproduce
  • easy to edit/update
  • easy to collaborate
  • standardized & fast

The bad

  • students must learn syntax
  • error-free to compile

The ugly?

  • inequality in student backgrounds

Text Formatting

# Header 1

## Header 2

### Header 3

This is normal sized text used in the body of our work. 

For bullet points, we use dashes, e.g. 

- Intro to RStudio
- More content
  - a sub-point
- Back to the original level

Document Types

R Markdown can produce a variety of document types (other than the default html page):

  • pdf_document makes a PDF with LaTeX (.pdf)

  • word_document for Microsoft Word documents (.docx).

  • odt_document for OpenDocument Text documents (.odt).

  • rtf_document for Rich Text Format documents (.rtf)

And others.

Presentation Types

R Markdown can also be re-purposed to produce a presentation file (as with this presentation):

  • io_slides opens in your browser and interactive (.html)

  • slidy another browser based presentation format (.html)

  • beamer makes a PDF with LaTeX (.pdf)

Data work

Think about data analysis as falling into three loose categories:

  • management & wrangling
  • visualization & summary statistics
  • modeling & inference

All of this occurs in the code "chunk"

Code chunks

  • To open a code chunk hit CMD + OPTION + I on a Mac

  • Or type out three backticks ``` folowed by {r}

  • And then three more back ticks ``` on another line.

  • Within the {r} you can specify options, like {eval = FALSE} if you don't want it to evaluate the code

  • Or you can label the code chunk, e.g. {r cars} labels the chunk "cars" in your ToC

Code Chunk: Example

```{r cars, echo = TRUE}
summary(cars)
```

The option echo = TRUE means that the code gets included in the rendered html.

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Slide with Plot (Reproduction of Sutter, 2009)

Dynamic Graphs

Plotly Graphs

Alter and check some data

##   session subject  r1  r2  r3  r4  r5  r6  r7  r8  r9  treatment team
## 1       1       1   0   0   0   0  10  10   0   0   0 individual   NA
## 2       1       2   0   0  30  40  40   0   0   0  20 individual   NA
## 3       1       3  30  30   0   0   0  60  60  10   0 individual   NA
## 4       1       4  20   0 100   0   0  30  75 100 100 individual   NA
## 5       1       5 100 100 100 100 100 100 100 100 100 individual   NA
## 6       1       6 100 100 100 100 100 100 100   0   0 individual   NA
##           uniqid
## 1 1_individual_1
## 2 1_individual_2
## 3 1_individual_3
## 4 1_individual_4
## 5 1_individual_5
## 6 1_individual_6

Statistical Tests

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  value by treatment
## W = 52876, p-value = 3.838e-10
## alternative hypothesis: true location shift is not equal to 0

Regression output

## 
## Call:
## lm(formula = value ~ treatment, data = SutNarrow)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.370 -29.385  -0.542  38.630  60.615 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          39.385      1.451  27.152  < 2e-16 ***
## treatmentmessage     21.985      1.994  11.028  < 2e-16 ***
## treatmentmixed       10.609      1.925   5.510 3.92e-08 ***
## treatmentpaycomm     10.886      2.144   5.077 4.09e-07 ***
## treatmentteamtreat   16.313      2.629   6.204 6.34e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34.81 on 2713 degrees of freedom
## Multiple R-squared:  0.04473,    Adjusted R-squared:  0.04333 
## F-statistic: 31.76 on 4 and 2713 DF,  p-value: < 2.2e-16

Or a Panel Regression

## Oneway (time) effect Random Effect Model 
##    (Swamy-Arora's transformation)
## 
## Call:
## plm(formula = value ~ treatment, data = SutNarrow, effect = "time", 
##     model = "random", index = c("uniqid"))
## 
## Balanced Panel: n=302, T=9, N=2718
## 
## Effects:
##                    var  std.dev share
## idiosyncratic 1197.631   34.607 0.985
## time            18.044    4.248 0.015
## theta:  0.5755  
## 
## Residuals :
##     Min.  1st Qu.   Median  3rd Qu.     Max. 
## -61.9533 -28.6148  -2.6745  33.8010  64.4377 
## 
## Coefficients :
##                    Estimate Std. Error t-value  Pr(>|t|)    
## (Intercept)         39.3854     2.0207 19.4914 < 2.2e-16 ***
## treatmentmessage    21.9850     1.9815 11.0951 < 2.2e-16 ***
## treatmentmixed      10.6093     1.9137  5.5437 3.246e-08 ***
## treatmentpaycomm    10.8862     2.1313  5.1079 3.484e-07 ***
## treatmentteamtreat  16.3130     2.6134  6.2420 4.996e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    3402300
## Residual Sum of Squares: 3248300
## R-Squared:      0.045255
## Adj. R-Squared: 0.043848
## F-statistic: 32.1494 on 4 and 2713 DF, p-value: < 2.22e-16

Even Fancy Regression Output

Dependent variable:
value
treatmentmessage 21.985***
(1.994)
treatmentmixed 10.609***
(1.925)
treatmentpaycomm 10.886***
(2.144)
treatmentteamtreat 16.313***
(2.629)
Constant 39.385***
(1.451)
Observations 2,718
R2 0.045

Math?

How about Bayes' Rule?

\[Pr(\mbox{Outcome} | \mbox{signal}) = \frac{\theta p}{\theta p - (1 - \theta)(1 - p)}\]

R Markdown uses \(\LaTeX\) for math and it immediately gets displayed in R Studio.

That is, \(\LaTeX\) without the challenges of learning the packages, tables, etc that makes learning \(\LaTeX\) so hard.

In-line equations are bracketed by single dollar signs $.

Off-set equations are bracketed by double dollar signs $$.

What else?

R Markdown and R Studio together have excellent capabilities.

  • R Studio can show you the output of the commands within the R Markdown file
  • R Studio has error-detection and debugging assistance for your code (unlike, e.g. STATA or aspects of Excel)
  • R Studio server can be hosted online and your students work with logins there

Lessons from experience

Michael:

Students will only learn commands through graded assignments

Aaron:

Students can struggle with basic computing (working directory, etc.)

R Link Love?

Notecards again…

  • Go back to your notecards
  • Re-read them
  • Did I cover what you wanted me to cover?
  • Do you have other or new concerns or questions? (write them on the back)
  • Chat to someone next to you about what you're thinking about.
  • Share with the workshop.
  • Yes, you're doing Think-Pair-Share…